Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Fully Quantized Transformer for Machine Translation

121

FIGURE 5.1

Performance of quantized BERT with varying weight bit-widths and 8-bit activation on

MRPC and MNLI-m.

5.2

Fully Quantized Transformer for Machine Translation

Prato et al. introduce FullyQT, an all-inclusive quantization strategy for the Transformer.

Also, it is the ﬁrst work to show that it is possible to avoid any loss in translation quality

with a fully quantized transformer [190]. Their method contains four parts: the quantization

scheme, the choice of quantized layer, tensor bucketing, and a unique design for zeros.

5.2.1

Quantization Scheme

The quantization scheme was uniform, meaning that the step size between two quantized

values is constant. This choice, which is an additional constraint, was made for practical

reasons. It simpliﬁes all computations required during inference, enabling the exploitation of

hardware resources more eﬃciently. Given an element x of a tensor X, uniform quantization

scheme is deﬁned as:

Q(x) = ⌊^clamp⁽^x^;^x^min^{, x}^max⁾⁻^x^min

⌉,

(5.7)

where xmin and xmax deﬁnes the endpoints of the quantization interval. The clamp function

associates all values outside of the [xmax, xmax] range to the closest endpoint, and ⌊·⌉

represents rounding to the nearest integer.

The step size s is computed by:

s = ^x^min⁻^x^min

2^b−1

(5.8)

where b is simply the bit precision.

When quantization is applied to weights, xmin and xmax are respectively min(X) and

max(X). However, when quantization is applied to activations, those values are running